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Abstract 

We present an approach for image database retrieval using a very large number of 
highly-selective features and simple on-line learning. Our approach is predicated on 
the assumption that each image is generated by a sparse set of visual "causes" and that 
images which are visually similar share causes. We propose a mechanism for gener- 
ating a large number of complex features which capture some aspects of this causal 
structure. Boosting is used to learn simple and efficient classifiers in this complex fea- 
ture space. Finally we will describe a practical implementation of our retrieval system 
on a database of 3000 images. 
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1 Introduction 

Image database retrieval can be viewed as a particular case of information retrieval (IR) where the 
task is: "Given a few example images, learn to retrieve other examples of that class from a very large 
database." [9,6, 13, 5, 7, 8, 11]. 

Retrieval differs from the more typical task of classification in that the number of potential 
image classes is extremely large and not known until query time. Thus traditional machine learning 
methods for classification are difficult to apply since they often require a small number of classes 
and a large set of labeled data {a;*, y 1 } (where x % is an input image and y % is the class label). 

Several related issues conspire to make the learning for image retrieval surprisingly difficult. 
First, the number of training examples is small (perhaps 4 or 5 images); ii) the database of images is 
very large (perhaps 100,000 images); and iii) images belong to multiple classes. From this informa- 
tion one might conclude that the learning task for image database retrieval is essentially impossible. 

An effective solution to this problem hinges on the discovery of a simplifying structure in the 
distribution of images. The distribution of natural images in constrained by the causal structure 
which generates them. The sparse causal structure of images is hard to ignore. A photograph chosen 
at random from the Web might be the "Eiffel Tower" or the "Taj Majal". While there are a very large 
number of possible objects, each image will contain just a few. 

Classical image database retrieval approaches attempt to classify images based on a small num- 
ber of common features, such as the number of vertical edges or the number of bright red pixels. 
Since both the Eiffel Tower and the Taj Majal have vertical edges, these features clearly cut across 
the boundaries of the causal structure. Learning the concept of "Eiffel tower" from example images 
using these features will require the learning algorithm to stake out a complex region in this feature 
space. 

Our approach for image database retrieval depends on two related proposals: i) that images are 
best represented using a very large and selective set of features; and ii) that learning a query (image 
class) should quickly focus on just a few of these features. 

In earlier work we proposed a scheme for computing a very large set of "complex features" 
[3], In this paper we will expand upon and further justify this feature set. We will then describe 
an efficient mechanism for learning a query based on "Boosting" [4]. The trained classifier gener- 
alizes well, using between 20 and 50 complex features (less than one percent of the total number 
of complex features). These approaches are perfectly complementary: the complex feature set rep- 
resents rich and subtle distinctions between images, while the boosted learning algorithm produces 
classifiers which generalize well and are very efficient. 

2 Complex features 

Most image database approaches use a very small set of "simple" features: for example the number 
of pixels of a given color, the number of vertical edges, or the number horizontal edges. Initial 
approaches focused exclusively on the color measures because they are pose insensitive [12]. Un- 
fortunately these measures are also very non-selective. Many images have blue pixels or vertical 
edges. A query in such a simple feature space must stake out a very complex and irregular region 
in feature space. Finding such a region is a complex process that requires a lot of data. As a result 
these systems do not attempt to learn the query from the user. Instead users are asked to choose 
weights for the various features based on prior knowledge and intuition. 



Our system differs from others because it detects not only simple first order features, such as 
oriented edges or color, but also measures how these first order features are related to one another. 
Thus by finding patterns between image regions with particular local properties, more complex - and 
therefore more discriminating - features can be extracted. 

The process starts out by first extracting a feature map for each type of simple feature (there 
are 25 simple linear features including "oriented edges", "center surround" and bar filters). Each 
features map is rectified and downsampled by two. The 25 feature maps are then used as the input to 
another round of feature extraction (yielding 625 = 25 x 25 feature maps). The process is repeated 
again to yield 15,625 feature maps. Finally each feature map is summed to yield a single feature 
value. 

More formally the characteristic signature of an image is given by: 

Si,j,k,c(I) = /_^ Ei,j,k(Ic) (1) 

pixels 

where / is the image, i, j and k are indices over the different types of linear filters, and the I c are 
the different color channels of the image. The definition of E is: 

E t (I) = 2 | [absiF, /)] (2) 

E id (I) = 2| [ab8(Fj ® Ei(T))] (3) 

Ei,j,k{I) = 2 | [abs(F k Eij(I))] . (4) 

where Fi is the ith filter and 2 J, is the downsampling operation. 

We conjecture that these features do in fact reflect some of the sparse causal structure of the 
image formation process. One piece of evidence which supports this conclusion is the statistical 
distribution of the complex feature values. Evaluated across an image database containing 3000 
images, these features are distinctly non-gaussian. The average kurtosis is approximately 8 and some 
of the features have a kurtosis as high as 120 (the gaussian has a kurtosis of 3). Observing this type 
of distribution in a filter is extremely unusual and hence highly meaningful. Since the distribution 
of the pixels is sub-gaussian, a random combination of pixel values would yield approximately 
gaussian distribution. 1 The discovery of features with these types of properties is highly unlikely. 
Recall that the final step in the feature computation is summing the pixels in each feature map. 
Given that the sum of independent variables tends toward a gaussian very quickly, the high kurtosis 
of the feature values is even more surprising. The kurtosis of the feature map pixels is much greater 
than the kurtosis of the features themselves. In our experiments, the top ten features had average 
kurtoses as high as 304. Experiments using only features with the lowest kurtoses resulted in poorer 
performance. 

3 The Query Learning Process 

At first it might seem that the introduction of tens of thousands of features could only make the query 
learning process infeasible. How can a problem which is difficult given ten to twenty features be- 
come tractable with 10,000. Two recent results in machine learning argue that this is not necessarily 



1 It is not unusual to observe kurtosis in the distribution of a non-linear feature. For example one could easily square a 
variable with gaussian distribution in order to yield a higher kurtosis. The complex features do not contain these sorts of 
non-linearities. At each level the absolute value of the feature map is computed. 



a terrible mistake: "support vector machines (SVM)" and "boosting" [2, 4]. Both approaches have 
been shown to generalize well in very high dimensional spaces because they maximize the margin 
between positive and negative examples. Boosting provides the closer fit to our problem because it 
greedily selects a small number of features from a very large number of potential features. 

In its original form the AdaBoost learning algorithm is used to combine a collection of weak 
classifiers to form a stronger classifier. The task of a weak learner is to search over a very large 
set of simple classification functions to find one with low error. The learner is called weak because 
we only expect that the returned classifier will correctly classify slightly more than one half of the 
examples. In order for the weak learner to be boosted, it is called upon to solve a sequence of 
learning problems. In each subsequent problem examples are reweighted in order to emphasize 
those which are incorrectly classified. The final strong classifier is a weighted combination of weak 
classifiers. 

The weak learner used in the image query domain attempts to select the single complex feature 
along which the positive examples are most distinct from the negative examples. For each feature, 
the weak learner computes a gaussian model for the positives and negatives, and returns the feature 
for which the two class gaussian model is most effective. In practice no single feature can perform 
the classification task with 100% accuracy. Subsequent weak learners are forced to focus on the 
remaining errors through example re-weighting. The AdaBoost algorithm re -weights the incorrectly 
classified examples in the following way. Define the classification error rate for the k th weak learner 
as ?7fc, the initial example weights as u>i, and define j3k = ^r ■ The new weights are Wi = Wi(3 k ~ ei , 
where e^ is 1 if the i th example is in error and otherwise. 

3.1 Learning from User Input 

The user defines a new query in an interactive fashion. The first step is to select two or three positive 
examples. Users found that it was somewhat tedious to hand pick negative examples. Instead we 
randomly choose 100 images to form a set of generic negative examples. This policy for selecting 
negatives is somewhat risky because it is possible that the set may contain true positives. We run 
AdaBoost for 30 iterations which is usually sufficient to achieve zero error on the training set. Each 
image in the database is then ranked by the margin of the strong classifier. The first goal of an image 
retrieval program is to present the user with useful images which are related to the query. Since the 
learning algorithm is most certain about images with a large positive margin, a set of these images 
are presented. Without further refinement this set of images often contains many false positives. 

Retrieval results can be improved greatly if the user is given the opportunity to select new training 
examples. Toward this end three sets of images are presented: i) test set images with large positive 
margin; ii) generic negative images which are close to the decision boundary; and iii) test set images 
which are close to the boundary. The first set is intended to allow the user to select new negative 
training examples which are currently labelled as strongly positive. The second set allows the user 
to discard generic negatives which are not true negatives. The third set allows the user to refine the 
decision boundary by labelling examples which determine the margin. 

In every case the final query is produced by running AdaBoost for 30 iterations. This yields 
a strong classifier which is a simple function of 30 complex features. Since image databases are 
very large, the computational complexity of the final classifier is a critical aspect of image database 
retrieval performance. 



4 Experiments 

Experimental verification of image database retrieval programs is a very difficult task. There are few 
if any standard datasets, and there are no widely agreed upon evaluation metrics. 

To test the classification performance of the system we constructed five classes of natural images 
(sunsets, lakes, waterfalls, fields, and mountains) using the Corel Stock Photo 2 image sets 1, 26, 27, 
28, and 114 respectively [1, 10]. Each class contains 100 images. We also used sets 1 through 30 
for a 3000 image data set to test retrieval performance. 

Figures 1 and 2 show results of queries for race cars, flowers, and waterfalls in the 3000 image 
data set. 

Figure 3 shows the average recall and precision for the five classes of natural images, each over a 
set of 100 random queries. Recall is defined as the ratio of the number of relevant images returned to 
the total number of relevant images. Precision is the ratio of the number of relevant images returned 
to the total number of images returned. 

5 Conclusion 

We have presented a framework for image database retrieval based on representing images with a 
very large set of highly-selective, complex features and interactively learning queries with a simple 
Boosting algorithm. The selectivity of the features allow effective queries to be formulated using 
just a small set of features. This supports our observation of the "sparse" causal structure of images. 
It also makes training the classifier simple, and retrieval on a large database fast. 
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Figure 1: Race cars: The top portion shows the positive examples followed by the top twenty 
retrieved images. The first row of the bottom portion lists the negative images in the training set 
which are close to the decision boundary and the second row lists images in the test set which are 
near the boundary. 
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Figure 2: Flowers and waterfalls: The positive examples are shown first followed by the top twenty 
retrieved images. 
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Figure 3: Average recall and precision for the five classes of natural images. 
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